768 research outputs found

    General functions to transform associate data to host data, and their use in phylogenetic inference from sequences with intra-individual variability

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Amongst the most commonly used molecular markers for plant phylogenetic studies are the nuclear ribosomal internal transcribed spacers (ITS). Intra-individual variability of these multicopy regions is a very common phenomenon in plants, the causes of which are debated in literature. Phylogenetic reconstruction under these conditions is inherently difficult. Our approach is to consider this problem as a special case of the general biological question of how to infer the characteristics of hosts (represented here by plant individuals) from features of their associates (represented by cloned sequences here).</p> <p>Results</p> <p>Six general transformation functions are introduced, covering the transformation of associate characters to discrete and continuous host characters, and the transformation of associate distances to host distances. A pure distance-based framework is established in which these transformation functions are applied to ITS sequences collected from the angiosperm genera <it>Acer</it>, <it>Fagus </it>and <it>Zelkova</it>. The formulae are also applied to allelic data of three different loci obtained from <it>Rosa </it>spp. The functions are validated by (1) phylogeny-independent measures of treelikeness; (2) correlation with independent host characters; (3) visualization using splits graphs and comparison with published data on the test organisms. The results agree well with these three measures and the datasets examined as well as with the theoretical predictions and previous results in the literature. High-quality distance matrices are obtained with four of the six transformation formulae. We demonstrate that one of them represents a generalization of the SĂžrensen coefficient, which is widely applied in ecology.</p> <p>Conclusion</p> <p>Because of their generality, the transformation functions may be applied to a wide range of biological problems that are interpretable in terms of hosts and associates. Regarding cloned sequences, the formulae have a high potential to accurately reflect evolutionary relationships within angiosperm genera, and to identify hybrids and ancestral taxa. These results corroborate earlier ones which showed that treelikeness measures are a valuable tool in comparative studies of biological distance functions.</p

    VICTOR: genome-based phylogeny and classification of prokaryotic viruses

    Get PDF
    Motivation Bacterial and archaeal viruses are crucial for global biogeochemical cycles and might well be game-changing therapeutic agents in the fight against multi-resistant pathogens. Nevertheless, it is still unclear how to best use genome sequence data for a fast, universal and accurate taxonomic classification of such viruses. Results We here present a novel in silico framework for phylogeny and classification of prokaryotic viruses, in line with the principles of phylogenetic systematics, and using a large reference dataset of officially classified viruses. The resulting trees revealed a high agreement with the classification. Except for low resolution at the family level, the majority of taxa was well supported as monophyletic. Clusters obtained with distance thresholds chosen for maximizing taxonomic agreement appeared phylogenetically reasonable, too. Analysis of an expanded dataset, containing >4000 genomes from public databases, revealed a large number of novel species, genera, subfamilies and families. Availability and implementation The selected methods are available as the easy-to-use web service ‘VICTOR’ at https://victor.dsmz.de. Supplementary information Supplementary data are available at Bioinformatics onlin

    Controlling false discoveries in high-dimensional situations: Boosting with stability selection

    Full text link
    Modern biotechnologies often result in high-dimensional data sets with much more variables than observations (n â‰Ș\ll p). These data sets pose new challenges to statistical analysis: Variable selection becomes one of the most important tasks in this setting. We assess the recently proposed flexible framework for variable selection called stability selection. By the use of resampling procedures, stability selection adds a finite sample error control to high-dimensional variable selection procedures such as Lasso or boosting. We consider the combination of boosting and stability selection and present results from a detailed simulation study that provides insights into the usefulness of this combination. Limitations are discussed and guidance on the specification and tuning of stability selection is given. The interpretation of the used error bounds is elaborated and insights for practical data analysis are given. The results will be used to detect differentially expressed phenotype measurements in patients with autism spectrum disorders. All methods are implemented in the freely available R package stabs

    TYGS is an automated high-throughput platform for state-of-the-art genome-based taxonomy

    Get PDF
    Microbial taxonomy is increasingly influenced by genome-based computational methods. Yet such analyses can be complex and require expert knowledge. Here we introduce TYGS, the Type (Strain) Genome Server, a user-friendly high-throughput web server for genome-based prokaryote taxonomy, connected to a large, continuously growing database of genomic, taxonomic and nomenclatural information. It infers genome-scale phylogenies and state-of-the-art estimates for species and subspecies boundaries from user-defined and automatically determined closest type genome sequences. TYGS also provides comprehensive access to nomenclature, synonymy and associated taxonomic literature. Clinically important examples demonstrate how TYGS can yield new insights into microbial classification, such as evidence for a species-level separation of previously proposed subspecies of Salmonella enterica. TYGS is an integrated approach for the classification of microbes that unlocks novel scientific approaches to microbiologists worldwide and is particularly helpful for the rapidly expanding field of genome-based taxonomic descriptions of new genera, species or subspecies

    Maximum Likelihood Analyses of 3,490 rbcL Sequences: Scalability of Comprehensive Inference versus Group-Specific Taxon Sampling

    Get PDF
    The constant accumulation of sequence data poses new computational and methodological challenges for phylogenetic inference, since multiple sequence alignments grow both in the horizontal (number of base pairs, phylogenomic alignments) as well as vertical (number of taxa) dimension. Put aside the ongoing controversial discussion about appropriate models, partitioning schemes, and assembly methods for phylogenomic alignments, coupled with the high computational cost to infer these, for many organismic groups, a sufficient number of taxa is often exclusively available from one or just a few genes (e.g., rbcL, matK, rDNA). In this paper we address scalability of Maximum-Likelihood-based phylogeny reconstruction with respect to the number of taxa by example of several large nested single-gene rbcL alignments comprising 400 up to 3,491 taxa. In order to test the effect of taxon sampling, we employ an appropriately adapted taxon jackknifing approach. In contrast to standard jackknifing, this taxon subsampling procedure is not conducted entirely at random, but based on drawing subsamples from empirical taxon-groups which can either be user-defined or determined by using taxonomic information from databases. Our results indicate that, despite an unfavorable number of sequences to number of base pairs ratio, i.e., many relatively short sequences, Maximum Likelihood tree searches and bootstrap analyses scale well on single-gene rbcL alignments with a dense taxon sampling up to several thousand sequences. Moreover, the newly implemented taxon subsampling procedure can be beneficial for inferring higher level relationships and interpreting bootstrap support from comprehensive analysis

    Phylogenies from whole genomes: Methodological update within a distance-based framework

    Get PDF
    Methods which derive pairwise distances directly from complete sequenced genomes are a potentially important and efficient tool within the growing field of phylogenomics.We have shown in two previous studies that the Genome BLAST Distance Phylogeny (GBDP) approach leads to reliable phylogenetic estimates if applied to prokaryotic as well as plastid and mitochondrial genomes. Basically, GBDP first invokes tools such as BLAST to identify high-scoring segment pairs (HSPs) between all pairs of genomes; afterwards, pairwise distances are estimated based on different formulae. Here, we examine (1) a new GBDP distance formula, based on a combination of two previously existing ones; (2) use of BLAT instead of BLASTN and TBLASTX HSP search; (3) an alternative measure for the agreement of a distance matrix with a predefined reference topology; (4) alternative topology-independent measures of distance quality per se. All examinations were based on a enlarged dataset compared to that used in our previous study, additionally containing interesting key taxa

    Implications of molecular characters for the phylogeny of the Microbotryaceae (Basidiomycota: Urediniomycetes)

    Get PDF
    BACKGROUND: Anther smuts of the basidiomycetous genus Microbotryum on Caryophyllaceae are important model organisms for many biological disciplines. Members of Microbotryum are most commonly found parasitizing the anthers of host plants in the family Caryophyllaceae, however they can also be found on the anthers of members of the Dipsacaceae, Lamiaceae, Lentibulariaceae, and Portulacaceae. Additionally, some members of Microbotryum can be found infecting other organs of mainly Polygonaceae hosts. Based on ITS nrDNA sequences of members of almost all genera in Microbotryaceae, this study aims to resolve the phylogeny of the anther smuts and their relationship to the other members of the family of plant parasites. A multiple analysis strategy was used to correct for the effects of different equally possible ITS sequence alignments on the phylogenetic outcome, which appears to have been neglected in previous studies. RESULTS: The genera of Microbotryaceae were not clearly resolved, but alignment-independent moderate bootstrap support was achieved for a clade containing the majority of the Microbotryum species. The anther parasites appeared in two different well-supported lineages whose interrelationship remained unresolved. Whereas bootstrap support values for some clades were highly vulnerable to alignment conditions, other clades were more robustly supported. The differences in support between the different alignments were much larger than between the phylogenetic optimality criteria applied (maximum parsimony and maximum likelihood). CONCLUSION: The study confirmed, based on a larger dataset than previous work, that the anther smuts on Caryophyllaceae are monophyletic and that there exists a native North American group that diverged from the European clade before the radiation of the European species. Also a second group of anther smuts was revealed, containing parasites on Dipsacaceae, Lamiaceae, and Lentibulariaceae. At least the majority of the parasites of Asteraceae appeared as a monophylum, but delimitations of some species in this group should be reconsidered. Parasitism on Polygonaceae is likely to be the ancestral state for the Microbotryaceae on Eudicot hosts

    Genome sequence-based species delimitation with confidence intervals and improved distance functions

    Get PDF
    Background For the last 25 years species delimitation in prokaryotes (Archaea and Bacteria) was to a large extent based on DNA-DNA hybridization (DDH), a tedious lab procedure designed in the early 1970s that served its purpose astonishingly well in the absence of deciphered genome sequences. With the rapid progress in genome sequencing time has come to directly use the now available and easy to generate genome sequences for delimitation of species. GBDP (Genome Blast Distance Phylogeny) infers genome-to-genome distances between pairs of entirely or partially sequenced genomes, a digital, highly reliable estimator for the relatedness of genomes. Its application as an in-silico replacement for DDH was recently introduced. The main challenge in the implementation of such an application is to produce digital DDH values that must mimic the wet-lab DDH values as close as possible to ensure consistency in the Prokaryotic species concept. Results Correlation and regression analyses were used to determine the best-performing methods and the most influential parameters. GBDP was further enriched with a set of new features such as confidence intervals for intergenomic distances obtained via resampling or via the statistical models for DDH prediction and an additional family of distance functions. As in previous analyses, GBDP obtained the highest agreement with wet-lab DDH among all tested methods, but improved models led to a further increase in the accuracy of DDH prediction. Confidence intervals yielded stable results when inferred from the statistical models, whereas those obtained via resampling showed marked differences between the underlying distance functions. Conclusions Despite the high accuracy of GBDP-based DDH prediction, inferences from limited empirical data are always associated with a certain degree of uncertainty. It is thus crucial to enrich in-silico DDH replacements with confidence-interval estimation, enabling the user to statistically evaluate the outcomes. Such methodological advancements, easily accessible through the web service at http://ggdc.dsmz.de, are crucial steps towards a consistent and truly genome sequence-based classification of microorganisms
    • 

    corecore